MetaboDynamics 0.99.0
Here we want to show a complete workflow to analyze concentration tables. As example we have a data set of irradiated cancer cells lines that were observed over four timepoints.
This package was built to facilitate the analysis of longitudinal metabolomics data. Most tools only allow the comparison between two time points or experimental conditions and are using frequentist statistical methods.
Here we want to show a complete workflow to analyze concentration tables.
library(MetaboDynamics)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
data("data_sim")
ggplot(data_sim, aes(x = measurement)) +
geom_density() +
theme_bw() +
facet_grid(cols = vars(time), rows = vars(condition)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
ggtitle("raw data", "raw measurements")
We have a simulated data set of 98 metabolites with three measurement replicates
at four time points (1-4) across 3 experimental conditions (A-B).
In the first step in this workflow we estimate the dynamics of every single
metabolite at every experimental condition (here: radiation dose).
As metabolomics data is often noisy and we generally have few replicates due
to high costs, a robust method is needed for the estimation of mean
concentrations at every time point. For this we employ a Bayesian hierarchical
model that assumes normal distributions of log-transformed metabolite
concentrations.
The next plot shows the raw dynamics of single metabolites.
# we standardize to a mean of zero and a standard deviation of one of log-transformed data
ggplot(data_sim, aes(x = log_m)) +
geom_density() +
theme_bw() +
facet_grid(cols = vars(time), rows = vars(condition)) +
ggtitle("data", "log-transformed values")
The raw data is not distributed normally. So let’s log-transform the values. In the integrated simulated dataset this is already done in the column “log_m”.
ggplot(data_sim) +
geom_line(aes(x = time, y = log_m, col = metabolite)) +
theme_bw() +
xlab("timepoint") +
theme(legend.position = "none") +
facet_grid(rows = vars(condition)) +
ggtitle("raw metabolite dynamics", "color=metabolite")
We define dynamics as deviations at the observed time points from the
metabolite’s mean concentration over time.
As the raw concentrations of metabolites can differ by orders of magnitudes from
each other, and we want to be able to compare dynamics of metabolites with each
other, we standardize each metabolite at each radiation dose to a mean of zero
and a standard deviation of one. In the simulated data set the scaled measurements
are in column “m_scaled”.
ggplot(data_sim) +
geom_line(aes(
x = time,
y = m_scaled, col = metabolite
)) +
theme_bw() +
xlab("timepoint") +
theme(legend.position = "none") +
facet_grid(rows = vars(condition)) +
ggtitle("standardized dynamics", "color=metabolite")
Now we can finally model the dynamics. This might take up to 10 minutes per experimental condition.
We employ a Bayesian hierarchical model with con= metabolite concentrations, m= metabolite, c= experimental condition and t= time point ID:
\[\begin{align*} \log(con_{m,c,t})&\sim {\sf normal}(\mu_{m,c,t},\sigma_{m,c,t}) \\ \mu_{m,c,t}&\sim {\sf normal}(0,2) \\ \sigma_{m,c,t}&\sim {\sf exponential}(\lambda_{m,c}) \\ \lambda_{m,c}&\sim {\sf exponential}(2) \end{align*}\]
The code below shows how to fit the model and how to extract the diagnostic criteria from the model fits.
# fit model
fits_dynamics <- fit_dynamics_model(
data = data_sim, scaled_measurement = "m_scaled", time = "time",
condition = "condition", max_treedepth = 14,
adapt_delta = 0.999, iter = 4000, cores = 7, chains = 4
)
# extract diagnostics
diagnostics_dynamics <- extract_diagnostics_dynamics(
data = data_sim, iter = 4000, # make sure iter is the same as used for fitting
# the dynamic model
scaled_measurement = "m_scaled",
fits = fits_dynamics, chains = 4
)
diagnostics_dynamics[["plot_divergences"]]
diagnostics_dynamics[["plot_treedepth_error"]]
diagnostics_dynamics[["plot_rhat"]]
diagnostics_dynamics[["plot_neff"]]
# PPCs can be accessed with
diagnostics_dynamics[["plot_PCC_A"]]
## Warning: Removed 296539 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
diagnostics_dynamics[["plot_PCC_B"]]
## Warning: Removed 285164 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
diagnostics_dynamics[["plot_PCC_C"]]
## Warning: Removed 293666 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
This returns a list of model fits that are named by the experimental condition (“A”,“B”,“C”). With extract_diagnostics_dynamics() we can extract all the diagnostic criteria of Bayesian models (rhat,neff,divergences,max_treedept) and visualize them. Additionally dataframes for visual Posterior predictive checks (PPC) are prepared and Plots generated for the PPCs and diagnostic criteria.
After checking the diagnostic criteria and the PPC we can extract the estimates:
# #extract estimates
estimates_dynamics <- extract_estimates_dynamics(
condition = "condition",
data = data_sim, fits = fits_dynamics, samples = 1,
iter = 4000
)
## Warning in cbind(metabolite = unique(data$metabolite), metabolite.ID =
## as.numeric(as.factor(unique(data$metabolite))), : number of rows of result is
## not a multiple of vector length (arg 3)
## Warning in cbind(metabolite = unique(data$metabolite), metabolite.ID =
## as.numeric(as.factor(unique(data$metabolite))), : number of rows of result is
## not a multiple of vector length (arg 3)
## Warning in cbind(metabolite = unique(data$metabolite), metabolite.ID =
## as.numeric(as.factor(unique(data$metabolite))), : number of rows of result is
## not a multiple of vector length (arg 3)
We get two major outputs: 1) the estimation of concentration differences between two subsequent timepoints of each metabolite at each experimental condition 2) the dynamic profiles of each metabolites at each experimental condition
# 1) the differences between two timepoints
estimates_dynamics[["plot_timepoint_differences"]]
If the 95% highest density interval of the posterior does not include zero
we can confidently stat that there is a difference in mean concentrations
between two time points. If the 95% HDI is below zero we have a decrease in
concentrations between the two time points, if it is above zero we have an
increase in concentrations between time points.
# 2) dynamic profiles
estimates_dynamics[["plot_dynamics"]]
So we now have dynamic profiles of the metabolites at each radiation dose.
What do we do with this?
We could cluster these dynamics vectors, that we obtained (estimates_dynamics[,c(“mu1.mean”:“mut.mean)]) to see if we find groups of metabolites that
have similar dynamics.
For the sake of demonstration we only show a rudimentary hierachical clustering with the number of optimal clusters being the number of groups we used for simulating the data (8). In a real life example optimal number of clusters can be determined by optimal clustering criteria such as Gap statistics and average silhouette.
# get distances between vectors
dd_A <- dist(
estimates_dynamics[["A"]][, c(
"mu1_mean", "mu2_mean",
"mu3_mean", "mu4_mean"
)],
method = "euclidean"
)
# hierachical clustering
clust <- hclust(dd_A, method = "ward.D2")
clust_cut <- cutree(clust, k = 8)
# assing cluster ID to estimates
clust_A <- estimates_dynamics[["A"]][, c(
"metabolite", "condition", "mu1_mean", "mu2_mean",
"mu3_mean", "mu4_mean"
)]
clust_A$cluster <- clust_cut
rm(dd_A, clust, clust_cut)
# get distances between vectors
dd_B <- dist(
estimates_dynamics[["B"]][, c(
"mu1_mean", "mu2_mean",
"mu3_mean", "mu4_mean"
)],
method = "euclidean"
)
# hierachical clustering
clust <- hclust(dd_B, method = "ward.D2")
clust_cut <- cutree(clust, k = 8)
# assing cluster ID to estimates
clust_B <- estimates_dynamics[["B"]][, c(
"metabolite", "condition", "mu1_mean", "mu2_mean",
"mu3_mean", "mu4_mean"
)]
clust_B$cluster <- clust_cut
rm(dd_B, clust, clust_cut)
# get distances between vectors
dd_C <- dist(
estimates_dynamics[["C"]][, c(
"mu1_mean", "mu2_mean",
"mu3_mean", "mu4_mean"
)],
method = "euclidean"
)
# hierachical clustering
clust <- hclust(dd_C, method = "ward.D2")
clust_cut <- cutree(clust, k = 8)
# assing cluster ID to estimates
clust_C <- estimates_dynamics[["C"]][, c(
"metabolite", "condition", "mu1_mean", "mu2_mean",
"mu3_mean", "mu4_mean"
)]
clust_C$cluster <- clust_cut
rm(dd_C, clust, clust_cut)
cluster <- rbind(clust_A, clust_B, clust_C)
rm(clust_A, clust_B, clust_C)
We combine all clustering results in one dataframe that hold columns “metabolite”, “condition”, “mu1-t.mean” and “cluster”. “Cluster” refers to the cluster ID of the metabolite.
temp <- cluster
temp <- temp %>% pivot_longer(
cols = c(mu1_mean, mu2_mean, mu3_mean, mu4_mean),
names_to = "timepoint", values_to = "mu_mean"
)
ggplot(temp, aes(
x = as.factor(as.numeric(as.factor(timepoint))),
y = mu_mean, group = metabolite
)) +
geom_line() +
xlab("timepoint") +
ylab("estimated mean concentration") +
theme_bw() +
theme(legend.position = "none") +
facet_grid(rows = vars(condition), cols = vars(cluster)) +
ggtitle("clustered dynamics", "panels=cluster ID")
rm(temp)
As we can see metabolites show different dynamics in different experimental conditions. Can we quantify the biological function of these dynamics clusters?
To quantify the possible biological function of these dynamics clusters we retrieved from the KEGG-database the following information (with package KEGGREST): 1) to which functional modules our experimental metabolites are annotated and 2) which metabolites are annotated to functional modules in general.
The functional modules of the KEGG-database are organised in three hierarchies: upper, middle and lower. Here we will do functional analysis on the middle hierarchy. To facilitate analysis the data frames “metabolite_modules”, which holds the information about experimental metabolites, and “modules_compounds”, which holds the information about which metabolites are in general annotated to functional modules, were prepared. We load both data sets and can inspect the documentation.
data("metabolite_modules")
help("metabolite_modules")
head(metabolite_modules)
## # A tibble: 6 × 8
## ...1 metabolite KEGG module_id module_name upper_hierarchy middle_hierarchy
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1-Aminocyc… C012… M00368 Ethylene b… Pathway modules Amino acid meta…
## 2 2 2-Aminomuc… C022… M00038 Tryptophan… Pathway modules Amino acid meta…
## 3 3 2-Phosphog… C006… M00001 Glycolysis… Pathway modules Carbohydrate me…
## 4 4 2-Phosphog… C006… M00002 Glycolysis… Pathway modules Carbohydrate me…
## 5 5 2-Phosphog… C006… M00003 Gluconeoge… Pathway modules Carbohydrate me…
## 6 6 2-Phosphog… C006… M00346 Formaldehy… Pathway modules Energy metaboli…
## # ℹ 1 more variable: lower_hierarchy <chr>
data("modules_compounds")
help("modules_compounds")
head(modules_compounds)
## # A tibble: 6 × 6
## ...1 module_id kegg_id upper_hierarchy middle_hierarchy lower_hierarchy
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2 M00001 C00267 Pathway modules Carbohydrate metaboli… Central carboh…
## 2 3 M00001 C00668 Pathway modules Carbohydrate metaboli… Central carboh…
## 3 4 M00001 C05345 Pathway modules Carbohydrate metaboli… Central carboh…
## 4 5 M00001 C05378 Pathway modules Carbohydrate metaboli… Central carboh…
## 5 6 M00001 C00111 Pathway modules Carbohydrate metaboli… Central carboh…
## 6 7 M00001 C00118 Pathway modules Carbohydrate metaboli… Central carboh…
We can also retrieve the necessary dataframes with:
# ORA_dataframes <- get_ORA_dataframes(data = data_sim, kegg = "KEGG",
# metabolite_name = "metabolite")
# metabolite_modules <- ORA_dataframes[["annotation"]]
# modules_compounds <- ORA_dataframes[["background"]]
Here we have to keep in mind that not all KEGG modules are suitable for testing on every observed organism and experimental condition. For example the modules “Xenobiotics biodegradation”,“Biosynthesis of other secondary metabolites” and “Biosynthesis of terpenoids and polyketides” should not be analyzed in a human lung cancer cell line.
# modules_compounds <- modules_compounds[-which(modules_compounds$middle_hierachy=="Xenobiotics biodegradation"),]
# modules_compounds <- modules_compounds[-which(modules_compounds$middle_hierachy=="Biosynthesis of other secondary metabolites"),]
# modules_compounds <- modules_compounds[-which(modules_compounds$middle_hierachy=="Biosynthesis of terpenoids and polyketides"),]
# metabolite_modules <- metabolite_modules[-which(metabolite_modules$middle_hierachy=="Xenobiotics biodegradation"),]
# metabolite_modules <- metabolite_modules[-which(metabolite_modules$middle_hierachy=="Biosynthesis of other secondary metabolites"),]
# metabolite_modules <- metabolite_modules[-which(metabolite_modules$middle_hierachy=="Biosynthesis of terpenoids and polyketides"),]
For the functional analysis we employ a hypergeometric model. We consider a functional module as over-expressed in a cluster if the 95% inter-quantile range (ICR) of the log-transformed probabilities of OvEs lies above zero. OvE refers to the ratio of observed metabolites in a cluster being mapped to a functional module over the number of expected metabolites in a cluster being mapped to a module under the assumption of a hypergeometric distribution (=drawing without replacement). We apply the functional analysis to the middle and lower hierarchy of functional modules.
data("cluster")
ORA <- ORA_hypergeometric(background = modules_compounds, annotations = metabolite_modules, clusters = cluster, tested_column = "middle_hierarchy")
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
ORA[["plot_ORA"]]
ORA_lower <- ORA_hypergeometric(background = modules_compounds, annotations = metabolite_modules, clusters = cluster[cluster$condition == "A", ], tested_column = "lower_hierarchy")
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
ORA_lower[["plot_ORA"]]
Great, we can now see which functional module is over- (green points and error-bars) or under-represented (none in this example) in which dynamics cluster! For instance in cluster 3 at condition A and C the modules “Energy metabolism” and “Carbohydrate metabolism” are over-represented.
We can not only do over-representation analysis of KEGG-functional modules but also compare dynamics clusters across different experimental conditions. For this we employ a Bayesian model that estimates the mean difference as well as the standard deviation of differences between dynamics clusters.
dist= vector of pairwise euclidean distances between each dynamic vector of cluster a and every dynamic vector of cluster b, ID= cluster pair ID \[\begin{align*} dist_{ID}&\sim {\sf normal}(\mu_{ID},\sigma_{ID}) \\ \mu_{ID}&\sim {\sf normal^+}(0,2) \\ \sigma_{ID}&\sim {\sf exponential}(1) \end{align*}\]
comparison_dynamics <- compare_dynamics(clusters = cluster, dynamics = c("mu1_mean", "mu2_mean", "mu3_mean", "mu4_mean"))
comparison_dynamics[["plot_dynamic_comparison"]]
The bigger and brighter a point is the smaller is the mean distance between
dynamic clusters and the smaller is the standard deviation. That means big bright
points indicate high dynamic similarity which small spread. Here B_8 and A_4
have high similiarity in dynamics.
comparison_metabolites <- compare_metabolites(clusters = cluster)
comparison_metabolites[["plot_metabolite_comparison"]]
We have two clusters that are very similar in their metabolite composition:
C_6 and A_5. If we compare that to the dynamic profiles and ORA analysis
we see that similar functional modules are over-expressed as expected BUT
the dynamics differ between the two radiation doses.
Can we facilitate visualization?
dynamics <- comparison_dynamics[["estimates"]]
metabolites <- comparison_metabolites[["Jaccard"]]
temp <- left_join(dynamics, metabolites, by = c("cluster_a", "cluster_b"))
ggplot(temp, aes(x = 1 / mu_mean, y = Jaccard)) +
geom_point(aes(col = 1 / sigma_mean)) +
theme_bw() +
scale_color_viridis_c() +
xlab("dynamics similarity") +
ylab("metabolite similarity") +
ggtitle("comparison of clusters")
x <- unique(temp[, "cluster_a"])
temp <- temp %>% mutate(scale_Jaccard = scale(Jaccard))
ggplot(temp, aes(x = cluster_b, y = cluster_a)) +
geom_point(aes(size = Jaccard, col = mu_mean)) +
theme_bw() +
scale_color_viridis_c(option = "magma") +
scale_x_discrete(limits = x) +
xlab("") +
ylab("") +
scale_y_discrete(limits = x) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(col = "dynamic distance", size = "metabolite similarity") +
ggtitle("comparison of clusters")
## Warning: Removed 23 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can find two cluster pairs that are pretty similar in regards to their
composing metabolites but dissimilar in regards to their dynamics. Their ORA
profiles are quit similar as expected from the similar metabolite compositions but
they show different dynamics between experimental conditions: B_7 and A_4
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Berlin
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.3.1 dplyr_1.1.4 ggplot2_3.5.1
## [4] MetaboDynamics_0.99.0 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.44.1 gtable_0.3.5 xfun_0.47
## [4] bslib_0.8.0 QuickJSR_1.3.1 inline_0.3.19
## [7] vctrs_0.6.5 tools_4.4.1 generics_0.1.3
## [10] stats4_4.4.1 parallel_4.4.1 tibble_3.2.1
## [13] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
## [16] S4Vectors_0.42.1 RcppParallel_5.1.9 lifecycle_1.0.4
## [19] GenomeInfoDbData_1.2.12 farver_2.1.2 compiler_4.4.1
## [22] stringr_1.5.1 Biostrings_2.72.1 tinytex_0.52
## [25] munsell_0.5.1 codetools_0.2-18 GenomeInfoDb_1.40.1
## [28] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
## [31] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
## [34] cachem_1.1.0 StanHeaders_2.32.10 rstan_2.32.6
## [37] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
## [40] purrr_1.0.2 bookdown_0.40 labeling_0.4.3
## [43] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
## [46] cli_3.6.3 magrittr_2.0.3 loo_2.8.0
## [49] pkgbuild_1.4.4 utf8_1.2.4 withr_3.0.1
## [52] scales_1.3.0 UCSC.utils_1.0.0 rmarkdown_2.28
## [55] XVector_0.44.0 httr_1.4.7 matrixStats_1.4.1
## [58] gridExtra_2.3 png_0.1-8 evaluate_0.24.0
## [61] knitr_1.48 IRanges_2.38.1 viridisLite_0.4.2
## [64] rstantools_2.4.0 rlang_1.1.4 Rcpp_1.0.13
## [67] glue_1.7.0 BiocManager_1.30.25 BiocGenerics_0.50.0
## [70] rstudioapi_0.16.0 jsonlite_1.8.8 R6_2.5.1
## [73] zlibbioc_1.50.0